class: center, middle, inverse, title-slide # Branching Out Into Isolation Forests ## R-Ladies Dallas ### Stephanie Kirmer
www.stephaniekirmer.com
@
data_stephanie
### December 7, 2020 --- # Follow Along! https://github.com/skirmer/isolation_forests --- # Introduction Isolation forests are a method using tree-based decisionmaking to separate observations instead of grouping them. You might visualize this in tree form: <img src="../IsolationForest1.png" alt="diagram1" width="600"/> --- # Introduction If you prefer to think about the points in two dimensional space, you can also use something like this:  Here you can see that a highly anomalous observation is easily separated from the bulk of the sample, while a non-anomalous one requires many more steps to isolate. --- # Getting Started Today we are going to implement this modeling approach using a sample of data from Spotify- song characteristics. We'll be using these libraries: * **modeling**: isotree, fastshap * **visuals**: ggplot2, plotly, patchwork --- # Load Data Our dataset: Spotify Tracks (via Kaggle) https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv ## Track Characteristics ``` ## [1] "acousticness" "artists" "danceability" "duration_ms" ## [5] "energy" "explicit" "id" "instrumentalness" ## [9] "key" "liveness" "loudness" "mode" ## [13] "name" "popularity" "release_date" "speechiness" ## [17] "tempo" "valence" "year" ``` --- # Looking at Examples .panelset[ .panel[.panel-name[Instrumental] ```r knitr::kable(head(dataset[dataset$instrumentalness > .94, c("artists", "name", "year")], 5)) ``` | |artists |name | year| |:--|:------------------------------------------|:-------------------------------------------|----:| |21 |['Moritz Moszkowski', 'Vladimir Horowitz'] |Etude in A-Flat, Op. 72, No. 11 | 1928| |23 |['Frédéric Chopin', 'Vladimir Horowitz'] |Andante spianato in E-Flat Major, Op. 22 | 1928| |27 |['Hafız Yaşar'] |Kız Saçların | 1928| |42 |['Dmitry Kabalevsky', 'Vladimir Horowitz'] |Sonata No. 3, Op. 46: II. Andante cantabile | 1928| |49 |['Shungi Music Crew'] |Rumours | 1928| ] .panel[.panel-name[Speechy] ```r set.seed(426) speechy = dataset[dataset$speechiness > .9 & dataset$year > 1965, c("artists", "name", "year")] knitr::kable(speechy[sample(nrow(speechy), 3), ]) ``` | |artists |name | year| |:-----|:----------------|:----------------------------------|----:| |23123 |['John Mulaney'] |Blacking Out and Making Money | 2009| |46975 |['John Mulaney'] |Law and Order and Mr. Jerry Orbach | 2009| |76943 |['John Mulaney'] |Crime News | 2009| ] .panel[.panel-name[Loud] ```r set.seed(400) loud = dataset[dataset$loudness > .85, c("artists", "name", "year")] knitr::kable(loud[sample(nrow(loud), 3), ]) ``` | |artists |name | year| |:------|:----------------|:---------------------------------|----:| |93132 |['The Stooges'] |Search and Destroy - Iggy Pop Mix | 1973| |108250 |['Apocolothoth'] |Sold | 1936| |127899 |['DYING SPASM'] |drag | 1944| ]] --- # Feature Engineering * Bin the years * Cut off songs pre-1960 |Var1 | Freq| |:--------------|-----:| |60s | 20000| |70s | 20000| |80s | 20000| |90s | 20000| |00s | 20000| |10s to present | 19656| --- # Function Syntax We don't need to set any outcome or dependent variable because that is not the objective of this algorithm. .panelset[ .panel[.panel-name[Calling Function] ```r iso_ext = isolation.forest( training_set[, features], ndim=1, ntrees=100, max_depth = 6, prob_pick_pooled_gain=0, prob_pick_avg_gain=0, output_score = FALSE) Z1 <- predict(iso_ext, training_set) Z2 <- predict(iso_ext, test_set) training_set$scores <- Z1 test_set$scores <- Z2 ``` ] .panel[.panel-name[Fitting Options] You also have a few options in how you fit the model. You can: * `output_score = TRUE` - return "outlierness" scores for training set * `output_dist = TRUE` - return pairwise distance between points (eg, difference degree between any two- it takes time to run) * return a model object to be used for predicting on new data (default) <img src="../tree_sample.png" alt="diagram2" width="400"/> ] .panel[.panel-name[Model Summary Data] These are examples of the available output from returning a model object. ```r summary(iso_ext) ``` ``` ## Isolation Forest model ## Consisting of 100 trees ## Numeric columns: 14 ``` ]] --- # Model Tuning Some of the hyperparameters will be very familiar from other kinds of tree-based models. These are others that might be worth tuning for your modeling. * `prob_pick_pooled_gain=0,`: higher value fits closer to training set, but risk of overfitting * `prob_pick_avg_gain=0,`: higher value is more likely to set outlier bounds outside training set range (less overfitting, but worse training performance) When a split is created in tree, it's random by default. These arguments change that, to increase probability of the split giving largest average or pooled gain. If you pass 1 to either, that creates a deterministic tree. --- # "Ghosting" Issue One more hyperparameter to look at: `ndim` indicates number of columns to combine to produce a split Having multiple clusters of non-anomaly points can create problems when using isolation forests, as shown here. This is the issue that Extended Isolation Forest (ndim > 1) is meant to remedy. <img src="isoforests_files/figure-html/unnamed-chunk-10-1.png" width="700" /> --- # Feature Importance <!-- --> --- # Peeking at Results .panelset[ .panel[.panel-name[Table of Tracks] | |artists |name | year| scores| |:------|:-----------------------------------|:-----------------------------------------|----:|---------:| |92291 |['Herb Alpert & The Tijuana Brass'] |Whipped Cream | 1965| 0.4667073| |7231 |['Spawnbreezie'] |Don't Let Go | 2011| 0.4455175| |135399 |['ILLENIUM', 'Jon Bellion'] |Good Things Fall Apart (with Jon Bellion) | 2019| 0.4416365| |61001 |['Bonnie "Prince" Billy'] |I See A Darkness | 1998| 0.4699742| |10548 |['Sammy Davis Jr.'] |Not for Me | 1964| 0.4343238| |12461 |['The The'] |Soul Mining | 1983| 0.4632189| ] .panel[.panel-name[Score Distribution] <img src="isoforests_files/figure-html/unnamed-chunk-13-1.png" width="600" /> ] .panel[.panel-name[Scatterplot (Training)] <img src="isoforests_files/figure-html/unnamed-chunk-14-1.png" width="650" /> ] .panel[.panel-name[Scatterplot (Test)] <img src="isoforests_files/figure-html/unnamed-chunk-15-1.png" width="600" /> ]] --- # PCA .panelset[ .panel[.panel-name[Component Choices] <img src="isoforests_files/figure-html/unnamed-chunk-16-1.png" width="650" /> ] .panel[.panel-name[3D Rendering]
] .panel[.panel-name[3D Outliers]
] .panel[.panel-name[3D Outliers Named]
] ] --- # Other Exploration .panelset[ .panel[.panel-name[Decade Proportions] <img src="isoforests_files/figure-html/unnamed-chunk-23-1.png" width="700" /> ] .panel[.panel-name[Score Density] <img src="isoforests_files/figure-html/unnamed-chunk-24-1.png" width="700" /> ]] --- # Further Links/Reference https://ggplot2.tidyverse.org/ https://plotly.com/r/3d-scatter-plots/ https://github.com/david-cortes/isotree https://github.com/david-cortes/outliertree (A different but related algorithm for tree based outlier identification) https://towardsdatascience.com/outlier-detection-with-extended-isolation-forest-1e248a3fe97b (More on the ghosting problem) https://arxiv.org/pdf/1811.02141.pdf --- # Thank you! [www.stephaniekirmer.com](http://www.stephaniekirmer.com) | @[data_stephanie](http://www.twitter.com/data_stephanie) | [saturncloud.io](http://saturncloud.io)